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Summary. An extension of reproducing l<ernel Hilbert space (RKHS) tlieory provides a new 
frameworl< for modeling functional regression models with functional responses. The approach 
only presumes a general nonlinear regression structure as opposed to previously studied linear 
regression models. Generalized cross-validation (GCV) is proposed for automatic smoothing 
parameter estimation. The new RKHS estimate is applied to both simulated and real data as 
illustrations. 
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1. Introduction 

In many experiments, functional data appear as the basic unit of observations. As a nat- 
ural extension of the multivariate data analysis, functional data analysis provides valuable 
insights into these problems. Compared with the discrete multivariate analysis, functional 
analysis takes into account the smoothness of the high dimensional covariates, and often 
suggests new approaches to the problems that have not been discovered before. Even for 
nonfunctional data, the functional approach can often offer new perspectives on the old 
problem. 

The literature contains an impressive range of functional analysis tools for various prob- 
lems including exploratory functional principal component analysis, canonical correlation 
analysis, classification and regression. Two major approaches exist. The more traditional 
approach, carefully documented in the monograph Ramsay and Silverman ( 20051 ). typically 



starts by representing functional data by an expansion with respect to a certain basis, and 
subsequent inferences are carried out on the coefficients. The most commonly utilized basis 
include B-spline basis for nonperiodic data and Fourier basis for periodic data. Another line 
of work by the French school, taking a nonparametric point of view, extends the traditional 
nonparametric techniques, most notably the kernel estimate, to the functional case. Some 
theoretical results are also obtained as a generalization of the convergence properties of the 
classical kernel estimate. 

We are concerned with the regression problem in this paper. In general, the regression 
problem takes the form 

y, = F{xi) + (1) 

In traditional nonparametric inference, with Xi and yi both being scalars, there exist a 
large number of methods including kernel and locally linear estimation. In functional data 
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analysis, one or more of the components, Xi,yi, and e^, are functions defined on some 
interval, here assumed to be the interval [0, 1]. Inferences are focused on the estimation of 
the structural component F{x), with the residual e modeling the noise or in general the 
component of observations not captured by the model. 

In the case of scalar responses y, at least two nonparametric approaches have appeared 
in the literature. The first method uses a simple kernel regression estimate 

Y.^mXi,x)) 

where d is a semi-metric on the space of functions. The second method is to use the RKHS 
framework, assuming the real valued function F is an element of the RKHS -ff , and the 
estimator is obtained as the niinimizer of the regularized loss 

Y^{y,-F{x,)f + \\\F\\H (2) 
i 

The construction of the Hilbert space in this scalar response case involves no extra technical 
difficulties compared with the classical multivariate case as long as we have a metric on the 
space of functions x, and then the kernel can be constructed with K{xi,X2) = k{d{xi, X2)) 
for any positive definite function k. The representer theorem for RKHS implies that the 
solution to (HI) has the form 

F{x) = ^ aik{x^, x) 

i 

with real coefficients a^. This representation can be plugged back into ^ and solve for 
the coefficients. Note that in both of these two nonparametric approaches, the same word 
"kernel" is used for different concepts, the exact meaning of the word should be clear from 
the context. 

In the case of functional responses y, the classical parametric inferences assume that F 
is linear in x. More explicitly, the pointwise model assumes that 

y{t) ^ a{t) + p{t)x{t) + e{t) 

while the integral model specifies that 

y{t)=a{t)+[ /3{s,t)x{s)ds + e{t) (3) 
"'0 

In contrast to traditional linear regression models, now the coefficient /3 is a function on 
[0, 1], or a bivariate function on [0, 1] x [0, 1]. To estimate the coefficient /?, again a basis is 
chosen to represent the functions involved and the problem is reduced to a multiple linear 
regression model. To our knowledge, nonparametric approaches to functional analysis with 
functional responses has not been studied before and we will embark on this task in the 
current paper. 

We consider the functional response model ([1]) in which the structural component F is 
a mapping from some space of functions to another function space. We assume that y(-) 
belongs to a Hilbert space H. Although it is not necessary to assume that x(-) belong to the 
same space, or even that there is an inner product associated with it, it will be convenient 
to require that x is in the same space as y, as we will assume in the following. 
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In this paper, we will develop an estimation procedure for functional response models 
within the framework of RKHS. For nonlinear models, the parametric approach above does 
not give satisfactory results. Our goal is to show that within the RKHS framework, a simple 
estimate can be developed which reduces to linear regression computations very mu ch like 
those derived in the parametric approach. Our work is motivated by the work of iPreda 
(|2007l ). Unlike the case of scalar response models treated in lPredal (j2007l ). the extension we 
need is more complicated, as we will discuss in Section 2. There we show how we should 
extend the notion and the construction of a RKHS in this new setting and also prove the 
corresponding representer theorem. In section 3, we present our new nonlinear model within 
the framework developed and also comment on the computations involved. Simulation 
studies and application to the well-known weather data are carried out in section 4. These 
results show clear advantage of our model compared to the par ametric linear reg r ession 
mo del in nonlinear contexts . A kernel estimate similar to that of iFerratv and Vieul (j2003l ) 
and iFerratv and Vieul (|200J) are also constructed for comparison. We conclude the paper 
in Section 5. Some technical details are deferred to the appendix. 



2. Reproducing kernel Hilbert spaces 

Following IWahbal (Il990l) . a (real) RKHS H is a Hilbert space of real- valued functions defined 
on, say, the interval [0, 1], in which the point evaluation operator Lt : H R,Lt{f) = f(t) 
is continuous. By Riesz representation theorem, this definition implies the existence of a 
bivariate function K{s, t) such that 

K{s, ■) e iJ, for all s G [0, 1] 

(reproducing property) for every f £ H and t G [0, 1], {K{t, •), /)^ = f{t) (4) 

The definition of a RKHS can actually start from a positive definite bivariate function K{s, t) 
and RKHS is constructed as the completion of the linear span of {K{s,-),s G [0, l]}with 
inner product defined by {K{s, •), K{t, ■))h = K{s, t). 

In the regression model ^ with functional response, we are dealing with the functional 
F taking values in the Hilbert space H. So the RKHS we construct should be a subs et of 
iF : H — > H}. To define the RKHS in this case, we follow the same procedure as in iWahbal 
Il99d) . 

Definition 1. A (functional) RKHS H is a subset of {F : H ^ H}. It is a Hilbert 
space, with inner product (•, ■)u, in which the point evaluation operator is a bounded linear 
operator, i.e., : F —>■ F{x) is a bounded operator from Ti. to H for any x E H . 

The definition above is not useful for constructing a RKHS. For our purpose, we need 
to explicitly define the kernel associated with the space. The continuity of ^ for 
any x £ H is equivalent to the continuity of the mapping F —^ {F{x),g)H for any x £ H 
and g £ H. By Riesz representation theorem applied to the Hilbert space Ti, there exists 
an element in Tl such that 

{KlF)H^{F{x),g)H (5) 

From the above, the mapping g —> is linear. By the boundedness of the point evaluation 
operator, we also have 



m,F)n^{F{x),g)H<C\\F\\n\\g\\H 
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which impUes that the mapping g is also bounded. And so g ^ K^{y) is a bounded 

hnear operator for any y G H, which we can denote by K(x,y), i.e, K(x,y)g :— K^[y)^ 
and we caU K{-,-) the reproducing kernel associated with Ti. Note that in this case, the 
reproducing property is defined by {K{x,-)g,F)-H — {F{x),g)H for any x e H,g e H , this 
is just a rewriting of ([5]). 

Two properties of the reproducing kernel are immediate, (a) K{x, y) = K{y, x)'^ , where 
the superscript T denotes the adjoint operator. This is a simple consequence of the following 
sequence of identities making use of dS]): {K{x,y)g, f) h = {K^{y),f)H = {K^,K^)u = 
{Klix),g)H = {g,K{y,x)f)H- (b) YJ'-=iYTj=i{K{x^,Xj)f^Jj)H > 0, which follows from 

From the above discussions, we have the following definition of a positive definite kernel. 

Definition 2. A (functional) nonnegative definite kernel is a bivariate mapping on 
HxH, taking values in L{H), the space of bounded linear operators from H to itself such 
that 

K{x,y)^K(y,xf 

and 

n n 

Y,Y.^K{x,,Xj)f,J,)H>Q 

i=l i=l 

where {xi\ and {fi} are any two sequences in H . If the double sum above is strictly positive 
whenever {xi} are distinct and {fi} are not all zero, K is a positive definite kernel. 

Given a positive definite kernel defined as above, we can construct a unique RKHS of 
functions on H taking values also in H, with K as its reproducing kernel. The proof of this 
statement follows exactly the same lines as for the real- valued case. 

3. Models for functional data 

We consider the inference and prediction problem for model ([1]). Given the observed func- 
tional covariates and responses {(xi, yi), . . . , (a;„, yn)}, a general approach to estimate F is 
to solve the following minimization problem: 

n 

^^nY,H-F{xMl + M\F\\n (6) 
1=1 

where a penalty term in 7i-norm with smoothing parameter A > is added to the empirical 
risk as is usually done in the smoothing spline regression. We use the simplest loss function 
ll^lli = /o^ ^^(^) here, although other types of loss can certainly be considered. 

The optimization problem above optimizes over the infinite-dimensional space Ti. which 
is not feasible in general. Fortunately, the representer theorem below reduces this problem 
to finite dimensions (if you consider H as a vector space with elements in H acting as 
"constants" ) . The proof of the following is deferred to the appendix. 

Theorem 1. Given the observations {(a^i, yi)}iLi, the solution to has the following 
representation 

n 

F = Y,K{x,,-)a, (7) 

i=l 
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with functional coefficients ai £ H . 

Two difficulties arise at this stage. First, tlie construction of K in general is difficult 
and a search of the literature does not seem to provide us with any clues about how to 
construct a positive definite kernel in general. Second, even if the kernel is constructed, it is 
not clear how to solve ([6]) after we plug in the representation Due to these difficulties, 
we are forced to choose the simplest functional kernel K{x,y) — a(d{x,y))I , where a(-) is 
a real positive definite function, / e L{H) is the identity operator, and d is a metric on H. 
We choose the simplest metric d{x, y) = \\x ~ y\\2 which is also used in IPredal ()2007h . It IS 



unfortunate that we will not be able to provide more complicated examples of the kernel, 
but this estimate works reasonably well in all our experiments. It is clear that K defined 
in this way is nonnegative definite, but to prove that it is positive definite requires a little 
more extra work. We state this result formally as a theorem and its proof is to be found in 
the appendix. 

Theorem 2. The functional kernel K{x,y) — a{\\x — y\\2)I is positive definite if a(-) is 
a (real) positive definite function. 

After this dramatic simplification, we are able to solve problem ([5]). Let Oij — a(||a;i — 
a;j||2) and make use of the representer theorem 1, we arrive at the following optimization 
problem 

n 

min ^ 1 1 2/j - X! 1 1 2 + ^ X! "'J ' "j" ) ^ (8) 

«=1 j i,j 

From this point on, there are definitely more than one way to preceed. For example, 
one can represent each ai by expansion with respect to a chosen basis. We choose instead 
to again compute ^ in the RKHS framework by assuming H is itself a (real) RKHS with 
reproducing kernel k. The loss term is first discretized, assuming the observations are made 
on a regular grid {ii, . . . , tr} on [0, 1], then another application of the representer theorem 
in the real-value case stated in Theorem 3 below reveals the solution to be 

T 

a, = ^6?A;(t,,.) (9) 

and we only need to compute the coefhcients {b]}. Formally, we have 
Theorem 3. Consider the raw data version of 0) 

n T 

miii^ ^[?/i(tO - ^ a^jaj{ti)f + A ^ a^j (ai, a,)H (10) 

i=l 1=1 j i,j 

The solution to ilU\) is of the form (0). 

The readers are referred to the appendix for detailed formula involved in the computa- 
tion. 

The model (fTO|) looks similar to the classical smoothing spline estimation with two 
differences. First, in the first term of pO)) . instead of trying to smooth yhy a single function, 
it tries to approximate each observation yi as a combination of a common set of functions 
{c^j}l=i- The coefficients reflects the similarity of the covariates Xi and xj. Second, 
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an unconventional penalty term J2i j o.ijio-i, otj)H appears, which involves cross-over terms 
(ai,aj)H with i ^ j. The loss term in (1101) seems natural and one could probably come 
up with this term without going through all the trouble of using functional RKHS and 
the derivation above. If this is the case, one would probably use a penalty term such as 
X^i foi' regularization purposes. We call the solution to the problem ([TUl) with 
this simpler penalty the modified RKHS estimate, as opposed to the RKHS estimate which 
solves the original problem (fTU|) . 

In our implementation, we use the Gaussian kernel for both RKHSs, so K{x,y) = 
exp{ — \\x — y\W/2(j'^}I and k{s,t) = exp{ — {s — t)^/2cr'^}. Thus there are at least three 
parameters, a, cr'. A, that need to be specified. For the width parameters in the kernels, we 
simply choose a to be the mean of all \ \xi — Xj\\2,i, j = 1, . . . ,n, and a' is similarly chosen 
to be the mean of \\ti — tj\\2,i,j — 1, . . . ,r. These choices are of course suboptimal but 
produce good results in our experiences, so we do not bother to search over these parameter 
spaces. The choice of the smoothing parameter is arguably m ore important . Generalized 
cross-validation (GCV), which has been extensively studied in Wahba ( 1990l ). can be used 



for the selection of A. Given a grid of values for A specified beforehand, GCV approximates 
the true error by 

V{X)^-\\{I~Ai\))y\\y[-Tr{I-A{X))r 
n n 

where A{X) is the "influence matrix" defined in Appendix B, and the final A is chosen to 

be the one that minimizes V{X). 

A nonpa rametric kerne l estim ate is studied in lFerratv and Vieul ( 2003[ ). lFerratv and Vieu 
(|2004 . and lFerratv et all (|2007l ). which is simply defined as 



^ E.H\\-.~-\\) ^ ^ 

In those papers, the authors only studied the kernel estimate for the model with scalar 
responses, but this estimate can obviously be used when the dependent variable is a curve. 
It also seems natural that we should smooth the response yi before plugging it into (fTTjl if 
the observations are noisy. In the next section, this kernel estimate will be compared with 
our RKHS estimates and the linear parametric estimate. 



4. Examples 

4.1. Simulation study 

One of the main goals of using RKHS estimate is to deal with nonlinearity in the data. 
In this simulation study, we compare four estimates. The RKHS estimate ([T]), its simpler 
version with modified penalty, the linear regression estimate ([3]) and the kernel estimate 
(fTTI) . The simulated data are generated by the following models: 

(a) yit) = Jo \t - s\x{s)ds 

(b) y{t)^J^'\t~s\xHs)ds 

(c) y{t) = siii{2Trt)x{t), t e [0, 1] 

(d) y{t) = cos(7rt)|x(t)|, t e [0, 1] 

Models (a) and (c) are linear in x while (b) and (d) are nonlinear, x(-) is generated as 
a standard Brownian motion with random starting position uniform in [0,5], and y{t) is 
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Table 1 . Results for the simulation study 





RKHS 


Modified RKHS 


Linear 


Oracle Kernel 


Kernel 


Model (a) 


1.000 


0.944 


0.327 


0.910 


2.773 


Model (b) 


1.000 


1.000 


11.632 


1.984 


2.035 


Model (c) 


1.000 


0.992 


0.471 


1.643 


1.954 


Model (d) 


1.000 


1.007 


2.892 


1.459 


1.735 



computed using expressions (a)-(d). An equispaced grid of 50 points on [0, 1] is used. The 
raw observations for the dependent variable are the values of y on the grid points with 
i.i.d. standard normal noise added. The width parameters a and a' in RKHS and modified 
RKHS estimates are set to be the mean of the distances of the covariates from the training 
data as explained in the last section. After some experimentation, we use B-spline basis 
of order 4 with 10 equispaced knots in the linear regression estimate with a penalty 
term involving the second partial derivatives of f3{s,t). The fitting of this linear model is 
performed using the fda package provided in R. A total of n = 30 replicates are used in 
model fitting. Since this is a simulation study, we generate a separate set of 50 replicates 
as validation to choose the smoothing parameter A in all three models, so GCV is not used 
here. For the kernel estimate, we again used a Gaussian kernel, and the only parameter is 
the width parameter inside the kernel. Although it can be fixed as in the RKHS estimator, 
and it indeed produces good results, we instead search over this one-dimensional space using 
the same validation data that was used to choose the smoothing parameter in other models. 
In kernel estimates, we use the raw data as yi in pip . In simulation studies, we can also 
use y{-) in (fTTjl before the standard normal noises are added, which we call the oracle kernel 
estimate for obvious reasons. In real applications, the performance of the kernel estimate 
should be worse than the oracle estimate if some kind of smoothing on the raw dependent 
data are use as a preprocessing step. The search over the width parameter put the kernel 
estimates in a favorable position compared to other estimates. Finally, after the parameters 
are estimated, we generate another 50 observations from model (a)-(d) as test data. These 
simulations are repeated 50 times for each model (a)-(d), and the mean of the mean square 
errors are reported in Table [1] 

In Table [l] the errors for the RKHS estimates are normalized to be 1 , and the errors of 
other estimates are shown as relative to the error for RKHS estimates. For linear models 
(a) and (c), the linear estimate clearly wins. For nonlinear models (b) and (d), the linear 
estimate perform badly compared to other estimates. The performance of the RKHS esti- 
mate and the modified RKHS estimate are almost identical. Although the kernel estimates 
do not perform as good as the RKHS estimates, it deserves further investigations due to its 
low computational complexity. 

Next we study the performance of GCV for estimating the smoothing parameter in the 
RKHS estimate and compare it to the estimate obtained from the validation data, which 
act as the background truth in our simulation study. Figure [T] show that for all four models 
(a)-(d), the GCV correctly identifies a good smoothing parameter to use in these cases. 



4.2. Application to the weather data 

The daily weather data consists of daily temperature and precipitation measurements 
recorded in 35 Canadian weather stations. These data are plotted in Figure [21 To save 
on computations, we subsample the data and use only the weekly measurements, so each 
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Fig. 1 . Comparison of the GCV with the error computed from validation data. The solid curves are 

the GCV estimates, the dashed curves are the error computed from validation set. The curves are 
shifted and normalized to show the shape of the curves rather than its absolute magnitudes. 
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Fig. 2. Daily weather data for 35 Canadian stations, thie curves plotted here result from using smooth- 
ing splines to fit the raw data. 



observation consists of functional data observed on a equispaced grid of 53 points. We treat 
the temperature as the independent variable and the goal is to predict the corresponding 
precipitation curve given the temperature measurements. As is previously done, we set 
the dependent variable to be the log tranformed precipitation measurements, and a small 
positive number is added to the values with precipitation recorded. The prediction of 
our RKHS estimate is shown in Figure [3] for four weather stations. Each plot is produced 
by fitting the model on the other 34 replicates, using GCV to choose the smoothing pa- 
rameter, and then finally calculating the predicted precipitation based on the temperature 
curve. The figure shows re asonable prediction accu r acy an d can be compared to the results 
presented in Chapter 16 of iRamsav and Silverman ( 2005 ). 



5. Conclusions 

We proposed a new approach to fitting a nonlinear functional regression model for functional 
responses. The simulations we conducted demonstrated the clear advantage of the RKHS 
estimate over the linear regression model when the true model is nonlinear. The estimate is 
also better than the simplistic kernel estimates used in traditional nonparametric regression. 

The advantages of the RKHS estimates are tied with the additional computational costs. 
In problem (jlOp . we are optimizing over the same number of curves as there are the number 
of replicates which is computationally difficult when n is large. Approximate solutions such 
as choosing a limited number of kernel basis centered around selected covariates may prove 
to be useful. 
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Fig. 3. Raw data (points) and predictions (solid) of log precipitation for four weather stations. 
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Appendix A Proofs for Theorem 1-3 

Proof of Theorem 1. Denote by Ho the subspace of TC spanned by the kernel centered at 
the observed covariates: Hq — {J2"^iK{xi,-)ai,ai e H}. Any F & H. can be written 
as F = Fq + G, where Fq G Hq and G ± Wo in T^- Then for any j e {1, . . . , n}, and 
any h & H, {G{xj),h)H = {K{xj, ■)h,G)-H = by the reproducing property (O, this 
impHes G{xj) = by the arbitrariness of ft. G -ff. Also, by the orthogonality of G and Fq, 
\\F\\h = \\Fo\\h + \ \G\\h > \\Po\\n if G is nonzero. This implies that 

n n 

J2\\y^- F{x,)\\l + M\F\\n > E " ^"(^Oll^ + M\Fo\\n 

i=l i=l 

if G is nonzero. So the minimizer F belong to Hq 

Proof of Theorem 2. We want to prove that ^ (A'(xi, Xj)/^, > 0. Since {K{xi,Xj)fi, fj)H = 
aij{fi, fj)H, the nonnegativity follows immediately from Schur's Lemma, which asserts that 
the Hadamard product of two nonnegative matrices is again a nonnegative matrix. 

If J2i,j{Ki^t^^j}ftJ3)H = 0, and {xi} are distinct. Let A = {aij},B = {{fiJj)H}- 
A is positive definite and B is nonnegative definite. We have the factorization B = 
E^E. So E^,j{Kix^,x,)f,J,)H = J:^,,A^lB^, = Y.^,J A^^E^ = Y.^,,^^, A^.E^^Ek, = 
Xlfe(Xli j EkiAijEkj). By the positive definiteness of A, we get E — 0, which in turn implies 
fi = Q for all i. 

Proof of Theorem 3. This is similar to the classical proof with smoothing splines. We 
write at = a^^o with a^^o G span{k{ti, •)} and gi in its orthogonal complement. Similar 
to the proof of theorem 1, we only need to show that J2i j '^iji'^ifi + 9i,Oijfl + gj)H > 
Tlii j '^iji'^iflT^jfi) H with equality only if = for all i. The proof of this statement is 
contained in the proof of Theorem 2. 

Appendix B Computational details 

We detail the computations involved in solving the model pUj) . The calculations are very 
similar to those used in linear regression model ([3]) using basis expansion, which is not 
surprising due to the representer theorem. Let A = {fly } and B be the n by T matrix 
containing the coefficients in the expansion ©: B = {h\}. Also, denote the n x T matrix 
{yi{ti)} by Y and T xT matrix {k{ti, tj)} by K. The objective function ITUl) can be written 
in matrix form: 

Tr{{Y - ABK)(Y - ABKf) + \Tr{ABKB^) 
Taking the derivative with respect to B, we want to solve 

{A^ A)B{KK'^) + XABK ^ AYK 

vectorizing the above equation gives us a system of linear equations 

[{KK'^) ® [A^A) + XK(g) A]vec{B) ^ [K ® A)vec{Y) 

So we have the formula 

vec{B) = [K®A + \I]-^vec{Y) 
and the fitted values are Y = ABK, the vectorized version of this is 

vec{Y) ^ {K (g) A)[K (g> A + XI]-'^vec{Y) 
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the matrix ^(A) := {K ^ A)[K ^ A + XI] ^ is the influence matrix used in the calculation 
of GCV. 
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